search image
4ea14e6090343523ddcd5d3ca449695f-Paper-Datasets_and_Benchmarks.pdf
Thus, there is a need for a reference point, on which each model canbetested andfrom where potential improvements canbe derived. In this study, we select publicly available state-of-the-art visual search models and datasets in natural scenes, and provide a common framework for their evaluation. To this end, we apply a unified format and criteria, bridging the gaps between them, and we estimate the models' efficiency and similarity with humans using a specific set of metrics.
- South America > Argentina (0.06)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Information Management (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
ViSioNS: Visual Search in Natural Scenes Benchmark
Thus, there is a need for a reference point, on which each model can be tested and from where potential improvements can be derived. In this study, we select publicly available state-of-the-art visual search models and datasets in natural scenes, and provide a common framework for their evaluation. To this end, we apply a unified format and criteria, bridging the gaps between them, and we estimate the models' efficiency and similarity with humans using a specific set of metrics.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Information Management (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Efficient Zero-shot Visual Search via Target and Context-aware Transformer
Ding, Zhiwei, Ren, Xuezhe, David, Erwan, Vo, Melissa, Kreiman, Gabriel, Zhang, Mengmi
Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks
Samiei, Manoosh, Clark, James J.
Most studies in computational modeling of visual attention encompass task-free observation of images. Free-viewing saliency considers limited scenarios of daily life. Most visual activities are goal-oriented and demand a great amount of top-down attention control. Visual search task demands more top-down control of attention, compared to free-viewing. In this paper, we present two approaches to model visual attention and distraction of observers during visual search. Our first approach adapts a light-weight free-viewing saliency model to predict eye fixation density maps of human observers over pixels of search images, using a two-stream convolutional encoder-decoder network, trained and evaluated on COCO-Search18 dataset. This method predicts which locations are more distracting when searching for a particular target. Our network achieves good results on standard saliency metrics (AUC-Judd=0.95, AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59). Our second approach is object-based and predicts the distractor and target objects during visual search. Distractors are all objects except the target that observers fixate on during search. This method uses a Mask-RCNN segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18 dataset. We release our segmentation annotations of targets and distractors in COCO-Search18 for three target categories: bottle, bowl, and car. The average scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57, MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available at https://github.com/ManooshSamiei/Distraction-Visual-Search .
- North America > Canada > Quebec > Montreal (0.28)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Wisconsin (0.04)
- (2 more...)